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in the whole ROLAP processing. The cost of the queries can be effectively reduced 
by estimating the query results using samples. To maintain the accuracy of ROLAP 
even when using samples, the samples need to be extracted in an appropriate 
unit. However, conventional query optimization methods only support record based 
sampling and cannot be applied for complex queries that have other sampling 
units, such as the ones that include grouping aggregate operations. We develop a 
query optimization method named SUPRA that preserves the sampling unit used in 
random data extraction. The method is designed to preserve both the sampling 
unit and the randomness of the sampling operation. Using this method, typical 
ROLAP queries can be transformed into more efficient ones than those obtained 
through conventional methods 
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Abstract 

Relational online analytical processing (ROLAP) re- 
duces the amount of storage required for maintaining vari- 
ous sizes of data cubes by materializing only parts of them 
in a lazy evaluation manner. In ROLAP, however, cube- 
creation queries need to be issued repeatedly in order to 
search for useful features (i.e. rules or patterns) within 
large-scale databases. The cube-creation cost can be a 
bottleneck in the whole ROLAP processing. The cost of 
the queries can be effectively reduced by estimating the 
query results using samples. To maintain the accuracy 
of ROLAP even when using samples, the samples need 
to be extracted in an appropriate unit. However, con- 
ventional query-optimization methods only support record- 
based sampling and cannot be applied for complex queries, 
that have other sampling units, such as the ones that in- 
clude grouping aggregate operations. In this paper, we de- 
velop a query-optimization method named SUPRA that pre- 
serves the sampling unit used in random data extraction. 
The method is designed to preserve both the sampling unit 
and the randomness of the sampling operation. Using this 
method, typical ROLAP queries can be transformed into 
more efficient ones than those obtained through the conven- 
tional methods. 



1. Introduction 

In recent years, the importance of data warehouse (DW) 
systems has been widely recognized, and many large-scale 
DW systems have been built for business purposes. On- 
line analytical processing (OLAP) [1, 4, 5] is one of the 
most practical analysis methods used with the DW sys- 
tems. The OLAP tries to find useful features from the multi- 
dimensional summary data usually called data cubes. Since 
a cube is prepared for each combination of analytical at- 
tributes, a large amount of storage is required for keeping 
various sizes of data cubes in advance. Relational-OLAP 
(ROLAP) is an effective approach to reduce the amount of 
storage by materializing parts of the data cubes in a lazy 
evaluation manner. In practical situations, however, in or- 
der to obtain data-cubes including useful features, appro- 
priate combinations of analytical attributes and also appro- 



priate preconditions for the records of the data cubes should 
be determined. Therefore, the ROLAP approach requires a 
number of cube-creation queries to be issued repeatedly in 
order to find appropriate features within the databases [3]. 
To make the entire analysis process efficient, each cube- 
creation query should be processed rapidly. 

One promising approach to enable rapid cube creation is 
to use sampling instead of using the whole database. The 
analyst can estimate the result of the query using sampled 
data and determine the parameters of the next cube-creation 
query based on the estimation. The reduction of process- 
ing data shortens the processing time of each query and the 
analyst can obtain the target features more quickly. In op- 
timization of sampling queries, the processing time can be 
reduced drastically by applying the sampling operation at 
an early stage of the query processing. However in business 
DW systems, because the data is shared among the vari- 
ous offices, the unit of the record used for storing does not 
always agree with the unit of the record needed for the anal- 
ysis. For instance, suppose we have a database that includes 
a sales table as shown in Figure 1 . 
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Figure 1. Conversion of the unit of records 

In this table, the unit of stored records is the same as the 
unit of an order. If we want to estimate the average order- 
price of all customers, we cannot simply sample the records 
from this table. The records having the same customer num- 
ber need to be sampled together as one unit, and the average 
should be estimated using these records. However, the con- 
ventional query optimization methods [7, 8, 10] only sup- 
port record-based sampling and do not consider the sam- 
pling unit of the sampling operation. Because a grouping 
aggregate operation converts the unit of records in its pro- 
cessing, the conventional methods cannot be applied for the 
queries including the grouping aggregate operations. This 



limitation of the conventional methods prevents further op- 
timization of the sampling queries in practical use. 

In this paper, we propose a new query optimiza- 
tion method named SUPRA (sampling unit preservation 
method) which preserves the sampling unit during the op- 
timization of the sampling operations. This method en- 
ables preservation of the sampling unit by adopting a spe- 
cial sampling operation in which all the records speci- 
fied as included in the same sampling unit are guaranteed 
to be extracted all together within the resulting samples. 
The method preserves both the sampling unit and the ran- 
domness of data extraction through its query optimization. 
When the cube-creating sampling queries are issued to the 
database, the query optimization module of the database 
system automatically identifies the sampling units of the 
queries and transforms them into equivalent ones in which 
the application point of the sampling operations have been 
moved forward. 

2. Sampling units of sampling queries 

In business DW systems, various kinds of operational 
data are collected from throughout the company, and the 
collected data are shared among offices having different 
purposes on data analysis. A sampling operation used for 
the analysis must extract records in an unit suitable for the 
purpose of the analysis. For instance, a query (Figure 2) is 
issued to the table (Figure 1 ) in order to estimate the average 
order-price of the customers. The RANDOM keyword in 
the query indicates the AVG operation should be processed 
using samples. In this example, the records having the same 
customer number need to be sampled all together as one 
sampling unit of the sampling operation (Figure 3(a)). 



SELECT AVG( /WM>0M(OrderPrice) ) 

FROM SELECT Customer, SUM(Price) AS OrderPrice 

FROM SalesTable 

GROUP BY Customer; 



Figure 2. Sampling query 

Unfortunately, the conventional methods [7, 8, 10] as- 
sume record-based sampling for its optimization and can- 
not be applied to queries that include operations other than 
record-based sampling. If the sampling unit does not agree 
with the unit for analysis, a proper estimation is no longer 
expected from the sampled records. Suppose we sample 
two customers at random from the above table and sum up 
the prices of them then calculate the average of both cus- 
tomers. The expected value of the average order-price is 
13000. This is the correct answer. However, if we apply 
record-based sampling to the table (Figure 3(b)) and sample 
four orders (two orders for each two customers) from the ta- 
ble at random, the expected value of this case is 9533.3- * ■. 
As shown like this, because the semantics of the sampling 
operation no longer agrees with the purpose of the analy- 
sis, we cannot draw the correct estimation from this ap- 
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(a) Customer-based sampling 
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(b) Record-based sampling 



Figure 3. Appropriate sampling unit 

proach. A method for sampling-query optimization that 
can preserve the sampling unit is needed to realize accurate 
sampling-based ROLAP processing. 

3. SUPRA: preservation of the sampling units 

In our approach, in order to preserve a sampling unit 
through the query optimization, a special sampling oper- 
ation is introduced as a new basic operation. In the fol- 
lowing, we first define the semantics of this sampling op- 
eration, then we explain query transformation procedures 
of the sampling operation with other basic database oper- 
ations. We also give an example of the query optimization 
and an efficient random sampling technique from a database 
table. We represent the table that results from an operation 
F to a table T by an execution formula F(T). 

3.1. Definition of the sampling operation 

SAMPLING operation: In our SAMPLING operation 
S(SC,SGC,T), two parameters are introduced to keep a sam- 
pling unit during the query transformation : sampling col- 
umn (SC) and sampling grouping column (SGC). The se- 
mantics of the operation is defined by these parameters. 

(i) If the sampling column is specified as a list of column 
names, the operation assigns a value randomly for each 
sampling columns and extracts all the records having 
the specified values in their sampling columns. 

(ii) If the sampling column is specified as "not specified", 
the operation extracts the records on a record basis. 

(iii) If the sampling column becomes an empty set, the op- 
eration extracts all the records from the table. 

The sampling operation is repeatedly applied to the table 
until sufficient number of samples are obtained. 

In addition, another set of column names can be specified 
as the sampling grouping column (SGC) of the SAMPLING 
operation. In that case, the records in table T are classified 



into groups according to the value of the sampling group- 
ing column, and records within each group having the same 
value in their sampling columns are regarded as one sam- 
pling unit. The extraction probability of each sampling unit 
needs to be the same within each group, while the probabil- 
ity can differ among the groups (see Figure 4). 
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Figure 4. Sampling parameters 

3.2. Query optimization procedures 

In general, a sampling query can be optimized by push- 
ing down the sampling operation and reducing the number 
of records to be processed. In the following, formulas for 
query transformation procedures of basic database opera- 
tions are described. In each description, = denotes equality 
of sampling operations on both sides of the formula. We de- 
fine the equivalence of the sampling operations as satisfying 
following two conditions through the query transformation. 

(CI) The sampling unit is preserved. 

(C2) The extraction probability of each sampling unit is pre- 
served. 

The query transformation begins with insertion of a sam- 
pling operation into the sampling query just before its ag- 
gregate operation. At the insertion time, the SC of the op- 
eration is specified as "not specified", and the SGC is spec- 
ified as the same as the grouping column of the aggregate 
operation (if it exists). In the following, proofs are given 
for the cases when the SC is specified other than "not speci- 
fied". The proofs for the record-based sampling are already 
given in the previous work [7, 8, 10] (except for the group- 
ing aggregate operation). 

Definition: Rsc%^ denotes a set of records extracted as 
one sampling unit by the execution of the left (or right) side 
of the formula assuming the sampling parameters are given 
as SC = sc and SGC = sgc. (sc and sgc are lists of randomly 
specified values.) 

(1) PROJECT operation 

The PROJECT operation P(PC, T) extracts the projection 
columns (PC) from table T. In this paper, we assume that 
the operation does not eliminate duplicates. A query in- 



cluding the PROJECT operation can be optimized by sim- 
ply exchanging its application order. 

S(SC, SGC, P(PC, T)) = P(PC, S(SC, SGC, T)) 

Proof: (CI) Suppose a record r € Rs C>agc - Because r sat- 
isfies SC = sc, r is also extracted in the right side of the for- 
mula (here we assume SC C PC) (i.e. Rf CtSgc C R? CtSgc ). 
The same holds true for the right side (i.e. R^ 8gc 2 
#?c, S9 c)- Thus it follows that R^ sgc = Rf c>sgc . 
(C2) From the discussion above, a sampling unit and the 
value of the SC have one-to-one correspondence. Thus if 
we apply the same sampling procedure for both sides of 
the formula, the extraction probability of the sampling units 
will be preserved. 

(2) SELECT operation 

The SELECT operation C(Pred, T) extracts records satis- 
fying a selection condition Pred from table T. A query in- 
cluding the SELECT operation can be optimized by simply 
exchanging its application order. 

S{SC, SGC, C(Pred, T)) = C{Pred, S{SC, SGC, T)) 

Proof: (CI) Suppose a record r G Rs Ct s 9 c- Because r sat- 
isfies SC = sc and the condition Pred, r is also extracted in 
the right side of the formula. The same holds true for the 
right side. Thus it follows that Rs Cy9gc = R? c ,sgc- 
(C2) The same reason with the previous operation. 



(3) JOIN operation 

The JOIN operation J(JC, S, T) joins two records from ta- 
bles S and T, if they have the same value for the join col- 
umn (JC). Here we assume equi-join. A query including the 
JOIN operation can be optimized by exchanging their ap- 
plication order and limiting the sampling columns and the 
sampling grouping columns to respective columns also in- 
cluded in each of the tables. 

S(SC, SGC,J(JC,S,T)) 
= J(JC, S{SC/S, SGC IS, S), S(SC/T, SGC/T, T)) 

Proof: (CI) Suppose a record r € R% CiSgc . The record r 
can be divided into two records having the same JC value 
by limiting the columns of r to respective columns included 
in each of the tables. Because these records satisfies the 
restriction on the SC values of the right side, the records 
also extracted and joined in the right side of the formula. 
The similar discussion also holds true for the right side. As 
a result, we can conclude that JSf C)5ffC = Rf c ,sgc- 
(C2) The same reason with the previous operation. 

In this query transformation, in order to improve the join 
selection ratio, the same SC values should be specified in 
both of the sampling operations distributed to the respective 
tables. 
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Figure 5. Transformation of JOIN query 

(4) GROUPING AGGREGATE operation 

The GROUPING AGGREGATE operation A(AC, GC, T) 
classifies the records of the table T into groups according to 
the value of grouping columns (GC), then calculates the ag- 
gregate value (such as summation or average) of the aggre- 
gate columns (AC) for each group. A query including the 
GROUPING AGGREGATE operation can be optimized by 
simply exchanging its application order if SC C GC (except 
the case when the SC is specified as "not specified")- 

S(SC, SGC,A(AC,GC,T)) 

= A(AC, GC, S{SC, SGC,T)) 

Proof: (Cl) The grouping aggregate operation does not 
change the SC value if SC C GC. Suppose a record r € 
Rscsgc- Because r satisfies SC = sc, the record also ex- 
tracted in the right side of the formula. The similar discus- 
sion also holds true for the right side. Thus it follows that 

pL _ pR 

^sc.sgc ~~ -^sc^sgc 

(C2) The same reason with the previous operation. 

When the value of the sampling columns specified as 
"not specified", the resulting records of grouping aggrega- 
tion are sampled on a record basis. Because the unit of 
a record changes during this operation, the conventional 
methods cannot cope with this case. The records having the 
same value in the grouping columns should be extracted as 
one sampling unit. In this case, we will apply the previous 
transformation procedure regarding the grouping columns 
as the sampling columns (Figure 6). Because each record 
in the results of the grouping aggregate operation and the 
value of the grouping column have one-to-one correspon- 
dence, the sampling unit of the sampling query can be han- 
dled correctly by this conversion. 

In addition, if the sampling grouping columns (SGC) 
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Figure 6. Transformation of GROUPING AG- 
GREGATE query 

are specified and satisfies SGC C SC, we can substitute 
the sampling columns of the operation with the difference 
between the sampling columns and the sampling grouping 
columns (SC-SGC) (Figure 7). 

S(SC, SGC, T) = S(SC - SGC, SGC, T) 

Decrease in the number of sampling columns reduces the 
CPU cost at the data extraction time. In the following, we 
will examine the equivalence of the sampling unit within a 
group constructed by the sampling grouping columns. 
Proof: (Cl) Suppose a record r € Rs Ct8gc - Because r 
satisfies SC = sc and SGC = sgc, the record also satisfies 
the restriction on the SC values of the right side. Thus the 
record is also extracted and classified into the same group 
in the right side of the formula. Similarly, if we assume 
r € R?c,sgo men r satisfies the restriction on the SC values 
except for the SGC values (SC-SGC=sc-s#c), and also sat- 
isfies SGC = sgc. From the above two facts, it follows that r 
satisfies SC = sc. Therefore, the record r also extracted and 
classified into the same group in the left side of the formula. 
Hence we can conclude that R% Ct8gc = R? c ,sgc- 
(C2) The same reason with the previous operation. 
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Figure 7. Transformation of SAMPLING query 



3.3. An example of query transformation 

This subsection shows an example of the query trans- 
formation. Suppose a database consists of three tables : a 
customer table, an order table, and a product table. These ta- 
bles store the information generated through a typical sales 
management work. The custID column is the key column of 



the customer table. The orderlD column is the key column 
of the order table, and the custID column of the order table 
is a foreign key column of the customer table. The orderlD 
column of the product table is a foreign key column of the 
order table. Then, suppose a query shown in Figure 8 is is- 
sued to the database to estimate the average order-price of 
customers for each customer category. The execution for- 
mula for this query is given as Figure 9. 



Figure 8. Complex sampling query 



A(OrdcrPricc, {Segment, Priority, Shipmode}, 
S(NotSpocificd, {Segment, Priority, Shipmode}, 
A(Price, {OrderlD, Segment, Priority, Shipmode}, 
J({OrdcrlD},ProductTable, 

J({CustomcriD}, CustomerTabic, OrderTablc)))) 



Figure 9. Execution formula for the query 

When we apply our optimization method to the exam- 
ple, the application order of the sampling operation is ex- 
changed with the grouping aggregate operation and the join 
operations step-by-step. Figure 1 0 shows the execution for- 
mula of the transformed example query. The sampling oper- 
ation is moved beyond the operations and is applied directly 
to the product and order tables. This reduces the number of 
records to be processed drastically. 



A(OrderPrice, {Segment, Priority, Shipmode}, 
A(Price, {OrderlD, Segment, Priority, Shipmode}, 

J({OrderID}, S({Order[D}, {Segment, Priority, Shipmode}, ProductTable), 
J({CustomerID}, CustamerTabie, 

S({ OrderlD}, {Segment, Priority, Shipmode}, OrderTable))))) 



Figure 10. Transformed query 

3.4. Efficient random sampling 

After the query optimization, sampling operations are 
applied directly to the tables. To extract records from the 
tables efficiently while preserving the sampling units, we 
adopted a cluster sampling method using hash functions. 
In this method, several "partition columns" are specified 
for each table in advance and all records in the table are 
classified into buckets according to their hash values of the 
partition columns. To each of the buckets, several blocks 
(fixed-size continuous areas) of the disk are allocated, and 
the records in a bucket are stored in these blocks. At the 
time of data extraction, a hash value is selected randomly 
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Table 1. Experimental Parameters 



for each of the sampling columns and all the records in a 
bucket having the specified hash value are extracted from 
the database as samples. If the partition columns and the 
sampling columns are the same, we can extract records from 
the buckets instead of evaluating hash values of the records. 
When some sampling columns do not agree with the parti- 
tion columns, the hash functions are used on those remain- 
ing columns. If the sampling column is specified as "not 
specified", the records are sampled an a record basis using 
conventional methods. For instance, the OrderlD column 
can be used as a partition column for the example given in 
the previous subsection. Using this method, we do not have 
to evaluate hash values of all the records nor have to access 
the table randomly using an index. 

4. Performance issues 

This section presents the experimental results of apply- 
ing our SUPRA method to a typical ROLAP query. We 
have implemented our SUPRA method on the HiRDB (a 
commercial parallel database system of Hitachi) running on 
the SR4300 (SP2 comparable) share-nothing-type parallel 
computer (CPU: Power2, 66 MHz) executing the AIX 4.1 

05. We used TPC-D [12] as a data model for the experi- 
ments. The query shown in Figure 8 was used as an exam- 
ple query. Experimental parameters are listed in Table 1 . 
We compared the following four sampling methods. 

1) Conventional method 

(a) Without an index Samples are extracted from the 

order table and the product table independently, 
and the records are joined with the customer ta- 
ble using a sort-merge method. However, this 
method estimate the query result incorrectly. 

(b) With indexes Samples extracted from the order 

table are joined with the product table and the 
customer table via indexes using a nest-loop 
method. In this method, because the sampling 
unit of the query is guaranteed by the indexes, 
the correct estimation can be obtained. 

2) SUPRA method 

(a) Optimization only Samples are extracted from 

the order table and the product table according 
to the hash value of the records, and the resulting 
records are joined with the customer table using 
a sort-merge join. 

(b) Optimization + hash partitioning Samples are 

extracted from the order table and the product 
table according to the hash values of the buckets, 



SELECT Segment, Priority, Shipmode, AVG(RANDOM(OrderPrice)) 
FROM SELECT OrderlD, Segment, 

Priority, Shipmode,SUM(Price)AS OrderPrice 
FROM CustomerTable, OrderTable, ProductTable 
WHERE CustomerTable.CustomerlD = OrderTable.CustomerlD 

AND OrderTable.OrderlD = ProductTable.OrderlD 
GROUP BV OrderlD, Segment, Priority, Shipmode 
GROUP BY Segment, Priority, Shipmode; 



and the resulting records are joined with the 
customer table using a sort-merge join. 

The following graph shows the relative sample extrac- 
tion time comparing to the time when no sampling opera- 
tion was included in the query. Figure 1 1 indicates that our 
query optimization method (method 2(a)) reduces the query 
processing time to about half of that when using the conven- 
tional method without an index (method 1(a)). When we 
combine our hash partitioning method to the previous op- 
timization method, the resulting SUPRA method (method 
2(b)) is twice as fast as the conventional method using the 
indexes (method 1(b)). Over 10,000 records are typically 
required for a multi-dimensional OLAP analysis (assuming 
the required accuracy is 99%). Our results confirm the supe- 
riority of our SUPRA methods for typical ROLAP queries. 
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Figure 11. Execution time of record sampling 



5. Related work 

Hellerstein et al. [6] tried to process online aggrega- 
tion queries by using sampling, and they claimed that ac- 
curate computation of the entire data cube is not necessary; 
approximations of the aggregates seem to be sufficient in 
many situations. They do not mention about the query op- 
timization of sampling queries. On the other hand, a num- 
ber of sampling methods [2, 1 1 , 9] to extract records from 
database files have been proposed, however, these methods 
do not consider the relational queries. 

The most immediately related method of our approach 
is described in the literature [7, 8, 10]. This conventional 
method provides query transformation procedures that ex- 
change the application order between the sampling opera- 
tion and each of the basic database operations, which only 
preserve the record- level extraction probability. In this 
method, the sampling units of the sampling operations are 
not considered and the method cannot be applied to sam- 
pling queries including record- unit converting operations, 
such as grouping aggregate operations. On the contrary, our 



query optimization method is designed to preserve the sam- 
pling unit during the query optimization, and can be used 
when the unit of storage and the unit of analysis do not 
agree. In addition, our method adopts a hash partitioning 
to extract sample records efficiently from a database table. 

6. Conclusion and future work 

We have developed a query optimization method, which 
preserves the sampling unit of the sampling operation dur- 
ing the query transformation, and a hash partition method to 
extract sample records efficiently form a database table. We 
then presented our experimental results from the evaluation 
of our developed methods. Using these methods, practical 
cube-creating queries can be processed rapidly. 

Future work of our method is to provide a quantitative 
method for determining the optimal partition columns. Us- 
ing our method, we are planning to improve the interactivity 
of our OLAP products. 
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